Lionel Hertzog
30.08.2021
Before jumping into the extra complexities of spatial models, it is worth reflecting when we do not need spatial models:
Formally if we have X a covariate spatially-varying and data y following:
\[ y \sim D(\beta X, ...) \]
With D being any statistical distribution. Then the spatial signal in the data will be captured by the model.
Suppose that we collected data with a clear spatial signal:
str(data)
'data.frame': 100 obs. of 4 variables:
$ x : num 33.62 45.98 4.65 22.13 3.95 ...
$ y : num 28.1 33.6 13 29.4 45.1 ...
$ covariate: num 0.755 0.408 0.327 0.959 0.98 ...
$ resp : num 5.75 2.36 1.29 7.66 7.86 ...
To assess whether the covariates sufficiently capture the spatial signal, one can check for spatial autocorrelation in the residuals using DHARMa:
m <- lm(resp ~ covariate, data)
dd <- simulateResiduals(m)
Biodiversity recording from citizen science programms often contain a spatial signal due to citizen reporting biodiversity records in specific areas:
In addition, citizen science biodiversity monitoring programs give informations on where a species was recorded (and is therefore assumed to be present), but an absence of records cannot be usually assumed to represent a species absence. This type of data is usually called presence-only.
One interesting class of models to fit to presence-only data with spatial signal is: point pattern analysis.
In point pattern analysis we have some georeferenced data and potentially some covariates.
The locations of the data depend on an underlying intensity function \( \lambda \)
Point pattern analysis will directly try to estimate the intensity function using:
\[ log(\lambda(s))) = \beta_0 + \beta_1 * covariate(s) + S(s) \]
INLA stands for Integrated Nested Laplace Approximation, it is an approximate Bayesian approach that offers a faster alternative to other Bayesian algorithms such as MCMC or HMC.
INLA can fit a wide range of models including:
Spatial effect in INLA are fitted using Stochastic Partial Differential Equations (SPDE)
Picture from Krainski et al.
The continuous spatial effect is estimated by using basis functions (like smooths) on a mesh which dramatically speed up the estimation of continuous spatial effects [more on this later].
Picture from Krainski et al.
The first step to fit spatial models in INLA is to define a mesh of the area of interest (the domain).
mesh <- inla.mesh.2d(loc.domain = domain, max.edge = 0.1)
The mesh plays an important role in estimating the spatial effects, the spatial effect will be estimated at the vertices / nodes of the triangles. As a result if the spatial effect to be estimated varies at smaller scale than the mesh triangles, the model will not be able to correctly estimate the spatial effect.
At the other extreme too precise mesh will lead to unnecessary long computing times.
It is usually recommended to start with coarser mesh to debug model code, explore the results and use finer meshes for final model runs.
There is also a ShinyApp built in INLA to help defining the mesh:
meshbuilder()
Given a mesh, to fit a spatial model in INLA/inlabru we need to define the SPDE and more specifically the priors for the standard deviation (sigma) and the scale (range) of the spatial effect.
There are various ways to define the SPDE, the recommended one is via Penalized Complexity priors that is parametrized as follow:
From Fulgstad et al 2019:
In the simulation study we observe good coverage of the equal-tailed 95% credible intervals when the prior satisfies \( P(\sigma > \sigma_0) = 0.05 \) and \( P(\rho < \rho_0) = 0.05 \), where \( \sigma_0 \) is between 2.5 to 40 times the true marginal standard deviation and \( \rho_0 \) is between 0.1 and 0.4 of the true range.
In other words if we assume a standard deviation of the spatial effect to be around 2 and a range of 1 we can set the spde in R:
spde <- inla.spde2.pcmatern(mesh,
# P(sigma > 10) = 0.05
prior.sigma = c(10, 0.05),
# P(range < 0.1) = 0.05
prior.range = c(0.1, 0.05))
Ok, but what do this sigma and range really represent?
Under the hood INLA uses the Matern correlation / covariance function to estimate the spatial field.
Remember that our spatial models are usually:
\[ y(s_i) \sim D(\mu(s_i), ...) \\ \mu(s_i) = \beta X + u(s_i) \]
Where \( s_i \) are the n locations where the data were sampled and u is a continuous spatial field that is commonly assumed to follow a multivariate normal distribution that can be parametrized as a gaussian process:
\[ u \sim GF(0, \Sigma) \]
The \( \Sigma \) matrix represent the correlation / covariance between the n observations, in other words if you have 100 locations, \( \Sigma \) has 100 rows and 100 columns.
Fortunately we do not have to individually estimate all of these parameters, we can use the Matern function to compute the correlation / covariance between all observations given their (eucledian) distance following:
\[ cov(s_i, s_j) = Matern(\|s_i - s_j\|, \sigma, \rho) \]
Where \( \sigma \) is the variance in the spatial effect and \( \rho \) the range.
rho controls how fast the spatial effect varies, smaller values represent faster changes, larger values slower changes. In other words smaller rho values leads to small scale variations.
sigma control the amplitude of the spatial effect, how high are the peaks and how low are the troughs.
mesh <- inla.mesh.2d(loc = data, max.edge = 0.1)
spde <- inla.spde2.pcmatern(mesh,
prior.sigma = c(10, 0.05),
prior.range = c(0.1, 0.05))
formula <- coordinates ~ Intercept +
spat(coordinates, model = spde)
mod <- lgcp(components = formula,
data = spdf,
samplers = spol)
Where spdf is a SpatialPointDataFrame of the observations and spol is a a SpatialPolygonDataFrame of the domain boundary.
range <- spde.posterior(mod, "spat", "range")
sigma <- spde.posterior(mod, "spat", "variance")
matern <- spde.posterior(mod, "spat", "matern.correlation")